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If  the  probabilities  of  selection  for  units  in  a  sample  of  a 
population  are  unknown,  one  should  not  use  the  data  to  draw 
statistical  inferences  about  the  population.  To  illustrate,  we  exam- 
ine questionable  practices  in  forest  inventory  and  derive  equations 
showing  the  bias  that  these  practices  can  generate. 
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INTRODUCTION 

It  is  well  known  in  forest  inventory  practice  that 
stratification  of  the  population  of  interest  is  an  effec- 
tive tool  both  in  reducing  the  variance  in  estimation 
and  in  providing  information  about  the  strata  them- 
selves (Schreuder  etal.  1993).  However,  its  use  can  be 
abused.  Suppose,  for  example,  the  strata  are  changed 
after  additional  information  becomes  available.  This 
can  lead  to  serious  bias  in  the  estimation  procedure  if 
the  wrong  probabilities  of  selection  are  used. 

Similarly,  other  sample  selection  rules  may  be 
applied  that  appear  to  be  practical  and  useful  but 
may  result  in  unequal  probabilities  of  selection.  Of- 
ten, if  this  is  not  taken  into  account  in  estimation, 
serious  bias  will  result.  In  some  cases,  the  sampling 
rules  used  in  the  past  are  unclear  so  the  true  prob- 
abilities of  selecting  the  sampling  units  are  unknown. 
Using  these  data  because  they  are  the  best  available 
can  result  in  unrealistic  results. 
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The  purpose  of  this  article  is  to  document  the 
danger  in  inappropriate  estimation  if  probabilities  of 
selection  are  assumed  equal  when  they  are  not. 


REVIEW  OF  LITERATURE 

Both  purposive  sampling  and  probabilistic  sam- 
pling have  their  uses.  In  purposive  sampling  particu- 
lar sample  units  are  selected  because  the  investigator 
thinks  he  or  she  knows  what  sample  best  represents 
the  population.  This  means  that  the  other  units  in  the 
population  have  zero  probability  of  being  included 
in  the  sample.  Such  a  sample  can  be  useful  when  a 
quick  decision  needs  to  be  made  (Schreuder  and 
Wood  1986,  Schreuder  et  al.  1993,  Ch.  6,  and  espe- 
cially the  example  in  Schreuder  1995  based  on  an 
example  by  D.  Basu).  In  probabilistic  sampling  all 
units  have  a  positive  probability  of  sampling  and 
these  probabilities  should  be  used  in  estimation  for 
scientific  validity. 

When  probabilities  of  selection  of  sample  units  are 
known  and  used  correctly  in  estimation,  totals  for  the 
variables  of  interest  measured  on  those  units  can  be 
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estimated  unbiasedly.  Problems  arise  when  (a)  these 
probabilities  are  not  known  because  sample  selec- 
tion is  changed  in  an  unquantifiable  way  in  the  field, 
(b)  are  assumed  known  when  in  fact  the  process  of 
how  units  were  selected  is  now  unknown,  or  (c) 
assumptions  in  the  selection  process  are  wrong. 

Williams  et  al.  (1995)  documented  the  potential 
estimation  bias  in  moving  a  cluster  of  subplots  into 
the  condition  class  of  the  central  subplot.  This  proce- 
dure was  used  by  some  Forest  Inventory  and  Analy- 
sis (FIA)  units  of  the  USDA  Forest  Service  but  is  now 
unacceptable.  The  contention  that  this  bias  is  not 
important  because  the  estimates  have  never  been 
questioned  is  untenable.  We  live  in  a  contentious  age 
and  scientific  credibility  needs  to  be  maintained.  Find- 
ing significant  bias  in  one  study  undermines  the  results 
of  other  similar  studies  showing  little  or  no  bias. 

A  method  used  in  timber  cruising  about  15  years 
ago  by  some  National  Forests  personnel  was  to  change 
the  basal  area  factor  in  variable  radius  plot  (VRP) 
sampling  if  insufficient  sample  trees  were  selected  at 
a  sample  point.  The  idea  was  to  ensure  an  adequate 
count  of,  say,  4-8  trees  per  point.  This  approach  was 
used  in  several  other  cases  in  the  western  United 
States  because  some  biometricians  found  in  empiri- 
cal studies  that  the  resulting  bias  was  negligible. 
However,  in  one  case  in  California  the  procedure 
was  found  to  incur  appreciable  bias  in  estimating 
wood  volume  because  the  actual  probabilities  of 
selection  were  modified  in  an  unquantifiable  way 
and  differed  from  those  assumed  in  estimation 
(Wensel  et  al.  1980,  Schreuder  et  al.  1981). 

AN  EXAMPLE 

Emphasis  in  the  past,  when  sampling  public  lands 
in  the  West,  was  on  the  timber  resources.  The  usual 
scenario  was  for  the  commercial  forest  land  base  to 
be  divided  into  several  age  classes  (strata).  Each 
stratum  was  composed  of  homogeneous  stands  of 
areas  ranging  from  a  few  to  several  hundred  acres 
and  each  stand  was  uniquely  defined  and  had  a 
known  acreage  attached  to  it.  All  strata  were  sampled 
identically.  An  initial  set  of  plots  was  used  to  estimate 
the  variability  within  each  stratum  and  to  estimate  the 
total  number  of  plots  necessary  to  meet  a  predetermined 
precision  level.  The  plots  were  distributed  via  an  optimal 
allocation  formula  producing  unequal  plot  selection 
probabilities  within  and  between  the  strata. 


Ten  years  later,  the  population  was  re-stratified 
from  three  to  four  strata  for  several  reasons.  Access  to 
better  aerial  photography  and  escalating  demands 
for  finer  class  separation  led  to  more  strata.  Some 
stands  were  harvested,  moving  them  from  the  oldest 
to  the  youngest  class,  and  natural  growth  of  the 
forest  shifted  stands  to  another  stratum.  The  size  of 
the  population  also  changed  between  the  two  time 
periods,  reflecting  additional  acreage  due  to  advances 
in  regeneration  techniques  and  deletions  of  acreage 
for  a  variety  of  non-timber  related  reasons.  Most  of 
the  original  plots  were  remeasured  and  additional 
plots  were  optimally  allocated  to  the  four  strata 
based  on  the  assumption  that  the  original  plots  had 
an  equal  probability  of  selection  in  the  new  strata. 
The  estimated  volumes  and  variances  were  com- 
puted assuming  a  stratified  random  sample  based  on 
the  new  stratification.  This  resulted  in  a  situation 
where  the  probabilities  of  selecting  the  plots  in  the 
sample  were  unequal  and  unknown. 

Formulas  for  bias  incurred  by  assuming  equal 
probabilities  of  selection  are  derived  in  the  next 
section.  Readers  who  are  not  interested  in  the  statis- 
tical details  can  skip  this  section. 

STATISTICAL  DEVELOPMENT 

Assume  three  (k^  age-class  strata  within  which 
n(l)  sample  plots  were  selected  and  distributed  by 
optimal  allocation.  Within  a  stratum,  units  were 
selected  with  equal  probability  but,  to  consider  a 
more  general  case,  the  probability  of  selecting  unit  i 
in  stratum  k,  was  IT  „ . 

1  k£\ 

A  proportion  of  the  n(l)  sample  units  are 
remeasured  at  time  two  and  additional  sampling 
units  are  also  selected,  to  give  n  sample  units.  These 
n  units  (plots)  are  assigned  to  one  of  four  new  (1^) 
strata.  For  simplicity,  we  assume  that  the  actual  k2 
strata  sizes  are  known  without  error,  although  this  is 
not  always  true:  Errors  would  make  it  more  difficult 
to  derive  good  estimates  of  estimation  bias. 

The  probabilities  of  reselecting  some  or  all  of  the 
original  nk  sample  units  is  clearly  Ilkl  times  some 
factor  that  reflects  both  the  percentage  of  units 
remeasured  and  which  of  the  ka  initial  strata  they  fell 
in.  New  units  would  have  probabilities  11^.  of  selec- 
tion for  stratum  t  (i=l,  (provided  the  new 
units  are  selected  independently  from  the  old  ones). 
Because  we  allow  different  probabilities  of  selection 
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within  the  new  1^  strata,  we  denote  these  probabili- 
ties bynu(.  (i=l, n)  for  all  sample  units,  either  new 
or  remeasured. 

Let  nf  units  fall  in  stratum  i  (£=1,      k2),  N  =  £n„ 

and  assume  we  want  to  estimate  Y  =  total  wood 
volume  in  stratum  I .  Given  this  situation,  our  unbi- 
ased estimators  of  volume  at  the  current  time  are: 


£  =  zyff/n 

;=1 


[1] 


where  Yei=  value  of  variable  y  in  stratum  £,  observa- 
tion j,  and 


K2  ^ 


Y  =  ZYr 

e=i 


[2] 


But  the  probabilities  may  no  longer  be  known, 
especially  for  units  established  at  time  one  due  to 
poor  record  keeping.  In  these  situations,  it  is  often 
assumed  that  the  units  within  each  stratum  have  the 
same  probability  of  selection.  Under  this  scenario, 
estimators  by  stratum  and  overall  are: 


1=1 


N,  =  I 


n 


[3] 


where  8 ,  =    1  with  probability  11^  . 

0  with  probability  1  -11^., 

;=1 


and 


[4] 


These  estimators  can  be  quite  seriously  biased  due 
to  the  earlier  stated  emphasis  on  timber  volume 
estimation  and  optimal  allocation  to  "preferred"  strata. 

The  expected  value  (E)  of  Y(  can  be  shown  to  be 


£(v;)=2:^n,2„N( 

i=l  n 


[5] 


which  is  unbiased  only  if  11^    =  n./N/.  The  esti- 

r 

mated  bias  of  Y,  then  is 


B,  =  E{Yf)-Yf 


[6] 


N; 


1=1  n 


i=l 


To  illustrate  the  magnitude  of  bias,  assume  we  use 


Y  =ZY, 

i=l 


with 


2  y„/n, 

1=1 


when  in  fact  we  should  have  used 


*2 


with 


X.  =Zy*/n 


7=1 


k2£j , 


For  simplicity,  we  use  U2f  .  instead  of  II2kr. 

Then  the  bias  (B)  of  estimator  Y  in  estimating 
parameter  Y  is  (assuming  1^=1) 


B  =  E(Y)-E(Y*)  =  Zyf(fn2f-l) 

i=i 


[7] 


N  N 

-iy;n2;-Y 

n  1=1 


If  p  percent  of  the  plots  were  selected  with  m  times 
the  probability  of  the  other  (l-p)N  plots  (m=2,  3, ...), 

N 

then  since  £  FL  —ti, 


N-pN 


in2l+  i  n2;  =  n. 

1=1  ;=1  J 


Then  if  n2!  =  m  0,.=  n,  (i.e.  n2l  =  IT,  Yl2]  =  £) 


N(mp-p  +  l) 
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So  B  can  be  rewritten  as 


B  =  — 


pN 

5& 


nm 


pN 
I 


N(mp-p  +  l) 


N-pN 


=1  (mp-p+T) 
>>»  (m-l)y, 


N-pN 

■+  z 


♦  — I  y. 


N(mp-p  +  l) 
-Y 


-Y 


[8] 


,=1  (mp-p  +  l)    (tnp-p  +  Y) 

The  bias  can  be  positive  or  negative  depending  on 
whether  the  higher  probabilities  are  associated  with 
the  larger  or  smaller  y.  values.  For  example,  for  y  = 
volume,  bias  can  be  expected  to  be  positive  if  the 
original  designs  emphasized  larger  sample  sizes  in 
the  strata  with  larger  trees.  Similarly,  one  might 
expect  negative  bias  for  mortality  since  it  is  often 
associated  with  smaller  trees. 

The  above  can  be  illustrated  with  a  simple  applica- 
tion of  [8].  Assume  p  =  1/2,  then 


BJ£(m-l)yi 


Y 


-Y. 


»'=i  i(m  +  l)    |(m  +  l) 
Now  to  simplify  [91,  assume  m  =  3,  then 

N/2  V  N/2 

B=Z  y,+--Y=Z  y-Y/2. 


[9] 


;=1 


[10] 


If  the  larger  y.  values  are  selected  with  probability  n , 
and  if  we  substitute  this  knowledge  into  [10],  we  get 

n/2       y    y  y 
B  =  X  v,  <  =  0 

and  if  the  smaller  y.-values  are  selected  with  prob- 
abilities n,  then 

n/2      y    y  y 

B  =  I  v,  >  =  0. 

i=i       2     2  2 


PRACTICAL  IMPLICATIONS 

To  provide  some  indication  of  how  serious  the  bias 
B  can  be,  assume  that  the  largest  units  are  selected 
with  three  times  the  probability  of  smaller  units 
(m=3)  and  that  the  average  size  of  the  N/2  largest 
units  is  twice  that  of  the  N/2  smallest  units.  Both 
assumptions  are  not  unreasonable  for  traditional 
timber-oriented  surveys.  Substituting  bias  informa- 
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tion  into  [10]  where  the  N  units  are  sorted  by  y.  values 
in  decreasing  order,  and  noting  that  the  overall  total 

is  Y  =  Np^-j,  where  %  is  the  mean  for  the  N/2 

largest  and  Y2  the  mean  for  the  N/2  smallest  units,  we 
obtain 


B  =  *L7i-m± 


NY, 


=  -—Y. 


3NY- 


Since  Y,  =  2Y2,  Y  =  — 1  and  if  we  express  B  as  a  percent 

of  Y,  we  get  B(%)  =  (-f  Y2  /  Y)*100%  =  17%. 

Similar  serious  biases  can  occur  if  the  smaller  y. 
values  are  selected  with  higher  probability. 


SUMMARY 

We  have  shown  for  a  simplified  but  realistic  situ- 
ation that  bias  in  estimation  can  be  serious  if  the 
probabilities  of  selecting  the  sampling  units  are  ig- 
nored. Our  recommendation  is  not  to  use  data  sets 
for  inference  when  these  probabilities  are  unknown. 
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